Overview

We explore the performance of text-davinci-003 and all Llama-2 models (both base and chat) on experiments 1-3 from Degano et al 2024, experiments 1-2 from Marty et al 2023 and experiments 4-6 from Marty et al 2022. All practice trials are used as a few-shot prompt (i.e. correct solutions were presented).

The results were retrieved by retrieving the log probability of the labels “good” / “bad” following the prompt. The selected answer is identified by selecting the max probability. For simplicity, the prompts are linked here:

Analysis

Below we analyse the selected response. We apply two analyses:

In our other studies, thw WTA approach has shown better fit to human data. However, based on plots below, the probability based analysis might be better for these datasets.

process_response <- function(d) {
d <- d %>% 
  rowwise() %>%
  mutate(
    chosen_response_llh = max(Mean_logprob_answer_good, Mean_logprob_answer_bad),
    chosen_response = ifelse(chosen_response_llh == Mean_logprob_answer_good, "Mean_logprob_answer_good", "Mean_logprob_answer_bad"),
    chosen_response = str_split(chosen_response, "_", simplify=T)[,4],
    norm_factor = sum(exp(Mean_logprob_answer_good), exp(Mean_logprob_answer_bad)),
    prob_good = exp(Mean_logprob_answer_good) / norm_factor,
    prob_bad = exp(Mean_logprob_answer_bad) / norm_factor
  )
  return(d)
}

We apply these analyses by-study and then group the respective stats by conditions of each respective study.

degano2024_processed <- process_response(degano2024)
marty2023_processed <- process_response(marty2023)
marty2022_processed <- process_response(marty2022)

degano2024_acc_rate <- degano2024_processed %>% 
  mutate(
    is_good = as.numeric(chosen_response == "good")
  ) %>% 
  group_by(model, Condition, Experiment) %>% 
  tidyboot_mean(column = is_good)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## ℹ The deprecated feature was likely used in the purrr package.
##   Please report the issue at <]8;;https://github.com/tidyverse/purrr/issueshttps://github.com/tidyverse/purrr/issues]8;;>.
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
degano2024_acc_rate
## # A tibble: 105 × 8
## # Groups:   model, Condition [35]
##    model               Condition Experiment     n empiri…¹ ci_lo…²  mean ci_up…³
##    <chr>               <chr>     <chr>      <int>    <dbl>   <dbl> <dbl>   <dbl>
##  1 Llama-2-13b-chat-hf Bad       Exp_1         18        0       0     0       0
##  2 Llama-2-13b-chat-hf Bad       Exp_2         18        0       0     0       0
##  3 Llama-2-13b-chat-hf Bad       Exp_3         18        0       0     0       0
##  4 Llama-2-13b-chat-hf Good      Exp_1         15        0       0     0       0
##  5 Llama-2-13b-chat-hf Good      Exp_2         15        0       0     0       0
##  6 Llama-2-13b-chat-hf Good      Exp_3         15        0       0     0       0
##  7 Llama-2-13b-chat-hf Good-Excl Exp_1          3        0       0     0       0
##  8 Llama-2-13b-chat-hf Good-Excl Exp_2          3        0       0     0       0
##  9 Llama-2-13b-chat-hf Good-Excl Exp_3          3        0       0     0       0
## 10 Llama-2-13b-chat-hf Target_1  Exp_1          6        0       0     0       0
## # … with 95 more rows, and abbreviated variable names ¹​empirical_stat,
## #   ²​ci_lower, ³​ci_upper
marty2023_acc_rate <- marty2023_processed %>% 
  mutate(
    is_good = as.numeric(chosen_response == "good")
  ) %>% 
  group_by(model, Quantifier_type, Condition, Experiment) %>% 
  tidyboot_mean(column = is_good)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
marty2023_acc_rate
## # A tibble: 112 × 9
## # Groups:   model, Quantifier_type, Condition [56]
##    model             Quant…¹ Condi…² Exper…³     n empir…⁴ ci_lo…⁵  mean ci_up…⁶
##    <chr>             <chr>   <chr>   <chr>   <int>   <dbl>   <dbl> <dbl>   <dbl>
##  1 Llama-2-13b-chat… Modal   Bad     Exp_1      18       0       0     0       0
##  2 Llama-2-13b-chat… Modal   Bad     Exp_2      18       0       0     0       0
##  3 Llama-2-13b-chat… Modal   Good    Exp_1      18       0       0     0       0
##  4 Llama-2-13b-chat… Modal   Good    Exp_2      18       0       0     0       0
##  5 Llama-2-13b-chat… Modal   Target… Exp_1       6       0       0     0       0
##  6 Llama-2-13b-chat… Modal   Target… Exp_2       6       0       0     0       0
##  7 Llama-2-13b-chat… Modal   Target… Exp_1       6       0       0     0       0
##  8 Llama-2-13b-chat… Modal   Target… Exp_2       6       0       0     0       0
##  9 Llama-2-13b-chat… Nominal Bad     Exp_1      18       0       0     0       0
## 10 Llama-2-13b-chat… Nominal Bad     Exp_2      18       0       0     0       0
## # … with 102 more rows, and abbreviated variable names ¹​Quantifier_type,
## #   ²​Condition, ³​Experiment, ⁴​empirical_stat, ⁵​ci_lower, ⁶​ci_upper
marty2022_acc_rate <- marty2022_processed %>%
  mutate(
    is_good = as.numeric(chosen_response == "good")
  ) %>% 
  group_by(model, Negation, Polarity, Inference_type, Condition, Experiment) %>% 
  tidyboot_mean(column = is_good)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(strap)`
marty2022_acc_rate
## # A tibble: 504 × 11
## # Groups:   model, Negation, Polarity, Inference_type, Condition [504]
##    model     Negat…¹ Polar…² Infer…³ Condi…⁴ Exper…⁵     n empir…⁶ ci_lo…⁷  mean
##    <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <int>   <dbl>   <dbl> <dbl>
##  1 Llama-2-… High    Negati… DIST    Bad     Exp_4       3   1           1 1    
##  2 Llama-2-… High    Negati… DIST    Good    Exp_4       3   1           1 1    
##  3 Llama-2-… High    Negati… DIST    Target  Exp_4       3   0.667       0 0.675
##  4 Llama-2-… High    Negati… FC      Bad     Exp_4       3   1           1 1    
##  5 Llama-2-… High    Negati… FC      Good    Exp_4       3   0.667       0 0.658
##  6 Llama-2-… High    Negati… FC      Target  Exp_4       3   0.667       0 0.686
##  7 Llama-2-… High    Negati… II      Bad     Exp_4       3   1           1 1    
##  8 Llama-2-… High    Negati… II      Good    Exp_4       3   1           1 1    
##  9 Llama-2-… High    Negati… II      Target  Exp_4       3   0.667       0 0.667
## 10 Llama-2-… High    Negati… SI      Bad     Exp_4       3   0.667       0 0.670
## # … with 494 more rows, 1 more variable: ci_upper <dbl>, and abbreviated
## #   variable names ¹​Negation, ²​Polarity, ³​Inference_type, ⁴​Condition,
## #   ⁵​Experiment, ⁶​empirical_stat, ⁷​ci_lower

The processed data is saved such that the probabilities of each response option as well as the selected response are included on top of the raw materials’ csvs (columns prob_good, prob_bad, chosen_response are the new columns containing the model results), along with the information which model was used to produce the results. The columns “prob_good”, “prob_bad” contain trial-level probability of each option, while “chosen_response” contains the chosen response based on the argmax over prob_good and prob_bad (i.e., WTA strategy).

Plots

Below, we plot the mean acceptance rate (i.e., the mean proportion of judgments that a trigger sentence is good) by-model, by experiment and by-condition.

WTA approach

Plot for Degano et al 2024:

degano2024_acc_rate %>% 
  ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper)) +
  geom_col() +
  geom_errorbar(width = 0.1) +
  facet_wrap(model~Experiment, ncol = 3) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

Marty et al 2023:

marty2023_acc_rate %>% 
  ggplot(., aes(x = Condition, y = mean, fill = Quantifier_type, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(model~Experiment) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

Marty et al 2022:

# by experiment
e4 <- marty2022_acc_rate %>% 
  filter(Experiment == "Exp_4") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

e5 <- marty2022_acc_rate %>% 
  filter(Experiment == "Exp_5") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30))

e6 <- marty2022_acc_rate %>% 
  filter(Experiment == "Exp_6") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

# bind the plots together
gridExtra::grid.arrange(e4, e5, e6)

Average probability

Now the same plots are replicated with the average probability assigned to the “acceptance” (i.e., “good”) response.

## # A tibble: 105 × 8
## # Groups:   model, Experiment [21]
##    model               Experiment Condition     n empiri…¹ ci_lo…²  mean ci_up…³
##    <chr>               <chr>      <chr>     <int>    <dbl>   <dbl> <dbl>   <dbl>
##  1 Llama-2-13b-chat-hf Exp_1      Bad          18    0.235   0.212 0.235   0.257
##  2 Llama-2-13b-chat-hf Exp_1      Good         15    0.219   0.201 0.219   0.237
##  3 Llama-2-13b-chat-hf Exp_1      Good-Excl     3    0.213   0.198 0.213   0.222
##  4 Llama-2-13b-chat-hf Exp_1      Target_1      6    0.250   0.221 0.251   0.287
##  5 Llama-2-13b-chat-hf Exp_1      Target_2      6    0.209   0.174 0.209   0.252
##  6 Llama-2-13b-chat-hf Exp_2      Bad          18    0.239   0.221 0.239   0.257
##  7 Llama-2-13b-chat-hf Exp_2      Good         15    0.230   0.213 0.230   0.246
##  8 Llama-2-13b-chat-hf Exp_2      Good-Excl     3    0.215   0.179 0.215   0.257
##  9 Llama-2-13b-chat-hf Exp_2      Target_1      6    0.230   0.199 0.229   0.259
## 10 Llama-2-13b-chat-hf Exp_2      Target_2      6    0.222   0.185 0.222   0.258
## # … with 95 more rows, and abbreviated variable names ¹​empirical_stat,
## #   ²​ci_lower, ³​ci_upper
## # A tibble: 112 × 9
## # Groups:   model, Quantifier_type, Condition [56]
##    model             Quant…¹ Condi…² Exper…³     n empir…⁴ ci_lo…⁵  mean ci_up…⁶
##    <chr>             <chr>   <chr>   <chr>   <int>   <dbl>   <dbl> <dbl>   <dbl>
##  1 Llama-2-13b-chat… Modal   Bad     Exp_1      18   0.226   0.199 0.226   0.252
##  2 Llama-2-13b-chat… Modal   Bad     Exp_2      18   0.284   0.259 0.285   0.312
##  3 Llama-2-13b-chat… Modal   Good    Exp_1      18   0.224   0.197 0.225   0.253
##  4 Llama-2-13b-chat… Modal   Good    Exp_2      18   0.251   0.223 0.251   0.279
##  5 Llama-2-13b-chat… Modal   Target… Exp_1       6   0.248   0.193 0.246   0.296
##  6 Llama-2-13b-chat… Modal   Target… Exp_2       6   0.287   0.222 0.287   0.348
##  7 Llama-2-13b-chat… Modal   Target… Exp_1       6   0.222   0.186 0.223   0.268
##  8 Llama-2-13b-chat… Modal   Target… Exp_2       6   0.256   0.228 0.255   0.278
##  9 Llama-2-13b-chat… Nominal Bad     Exp_1      18   0.227   0.197 0.227   0.259
## 10 Llama-2-13b-chat… Nominal Bad     Exp_2      18   0.266   0.235 0.266   0.301
## # … with 102 more rows, and abbreviated variable names ¹​Quantifier_type,
## #   ²​Condition, ³​Experiment, ⁴​empirical_stat, ⁵​ci_lower, ⁶​ci_upper
## # A tibble: 504 × 11
## # Groups:   model, Negation, Polarity, Inference_type, Condition [504]
##    model     Negat…¹ Polar…² Infer…³ Condi…⁴ Exper…⁵     n empir…⁶ ci_lo…⁷  mean
##    <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <int>   <dbl>   <dbl> <dbl>
##  1 Llama-2-… High    Negati… DIST    Bad     Exp_4       3   0.594   0.526 0.594
##  2 Llama-2-… High    Negati… DIST    Good    Exp_4       3   0.580   0.537 0.582
##  3 Llama-2-… High    Negati… DIST    Target  Exp_4       3   0.568   0.438 0.565
##  4 Llama-2-… High    Negati… FC      Bad     Exp_4       3   0.575   0.551 0.576
##  5 Llama-2-… High    Negati… FC      Good    Exp_4       3   0.567   0.436 0.571
##  6 Llama-2-… High    Negati… FC      Target  Exp_4       3   0.568   0.484 0.566
##  7 Llama-2-… High    Negati… II      Bad     Exp_4       3   0.530   0.5   0.531
##  8 Llama-2-… High    Negati… II      Good    Exp_4       3   0.634   0.613 0.635
##  9 Llama-2-… High    Negati… II      Target  Exp_4       3   0.539   0.481 0.538
## 10 Llama-2-… High    Negati… SI      Bad     Exp_4       3   0.528   0.423 0.528
## # … with 494 more rows, 1 more variable: ci_upper <dbl>, and abbreviated
## #   variable names ¹​Negation, ²​Polarity, ³​Inference_type, ⁴​Condition,
## #   ⁵​Experiment, ⁶​empirical_stat, ⁷​ci_lower
degano2024_prob %>% 
  ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper)) +
  geom_col() +
  geom_errorbar(width = 0.1) +
  facet_wrap(Experiment~model) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

marty2023_prob %>% 
  ggplot(., aes(x = Condition, y = mean, fill = Quantifier_type, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Experiment~model) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

e4_prob <- marty2022_prob %>% 
  filter(Experiment == "Exp_4") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

e5_prob <- marty2022_prob %>% 
  filter(Experiment == "Exp_5") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

e6_prob <- marty2022_prob %>% 
  filter(Experiment == "Exp_6") %>%
  ggplot(., aes(x = Condition, y = mean, fill = Polarity, ymin = ci_lower, ymax = ci_upper)) +
  geom_col(position = position_dodge()) +
  geom_errorbar(position = position_dodge(0.95), width = 0.1) +
  facet_wrap(Inference_type~model, nrow = 4) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

# bind the plots together
gridExtra::grid.arrange(e4_prob, e5_prob, e6_prob)

Explore 0-shot results

An exploration of zero-shot results was also conducted (with Llama 2 7b base, 7b chat, 13b base and 13b chat). The WTA results and trial-level probability results are shown below.

For Degano et al 2024, as an example for direct comparison, put together zero-shot with few-shot results:

zero_shot_acc_rate_prompts <- zero_shot_degano_acc_rate %>% 
  mutate(prompting = "zero-shot") %>%
  rbind(., 
        degano2024_acc_rate %>% 
          #filter(model == "Llama-2-13b-chat-hf") %>%
  mutate(prompting = "few-shot"))

zero_shot_prob_prompts <- zero_shot_degano_prob %>% 
  mutate(prompting = "zero-shot") %>%
  rbind(., 
        degano2024_prob %>% 
          #filter(model == "Llama-2-13b-chat-hf") %>%
  mutate(prompting = "few-shot"))

Average probability:

zero_shot_prob_prompts %>% 
  ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper, fill = prompting)) +
  geom_col(position=position_dodge()) +
  geom_errorbar(width = 0.1, position = position_dodge(0.95)) +
  facet_wrap(Experiment~model) +
  ylab("Probability of acceptance") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

WTA:

zero_shot_acc_rate_prompts %>% 
  ggplot(., aes(x = Condition, y = mean, ymin = ci_lower, ymax = ci_upper, fill = prompting)) +
  geom_col(position=position_dodge()) +
  geom_errorbar(width = 0.1, position = position_dodge(0.95)) +
  facet_wrap(Experiment~model) +
  ylab("Acceptance rate") +
  theme_csp() +
  theme(axis.text.x = element_text(angle=30)) 

Now we replicate the overview plots from above on the zero-shot data. WTA:

Item-level probability (wide-scope):